Face tracking with rotations using Mediapipe¶


Wim Pouw (wim.pouw@donders.ru.nl)

isolated

isolated

Info documents¶

Mediapipe is a versatile package with very user-friendly installation process and fast tracking. It also has a face tracker, but in comparison to other computer vision methods it does not have head rotations. This can be problematic as for example for head nodding or shaking it would be best to express such behavior along a rotation of the x (for nodding; also called pitch) or y (for shaking; also called yaw) axes. Also you might want to study head turn directions. Or you may want to make sure there are not head rotations.

Luckily, through some linear algebra applied to the facemesh data you can get this information. Here we will track the face with a 3D facemesh (next to hands and body), but we also add the rotation data in degrees for the nose. This python notebook runs you through the procedure of taking videos as inputs with a single person in the video, and outputting the facemesh coordinates with head rotations added based on a solution proposed overviewed by Susanne Thierfelder.

We make use of the Holistic Tracking by Google's Mediapipe to track the face (and you can uncomment the code to add other modalities), similar to our on envisionBOX masking module. We then implemented the very nice solution of estimating head rotations from the facemesh points.

  • location current Repository: https://github.com/WimPouw/envisionBOX_modulesWP/tree/main/MediaPipeFaceRotation/

  • location Jupyter notebook: https://github.com/WimPouw/envisionBOX_modulesWP/blob/main/MediaPipeFaceRotation/face_tracking_rotations.ipynb

Additional information backbone of the tool (Mediapipe Holistic Tracking)¶

https://google.github.io/mediapipe/solutions/holistic.html

Citation for Mediapipe¶

citation: Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., ... & Grundmann, M. (2019). Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.

Citation for Susanne Thierfelder solution¶

Thierfelder, S. (2024, 19-4). Head Pose Estimation with MediaPipe and OpenCV in JavaScript. Medium. https://medium.com/@susanne.thierfelder/head-pose-estimation-with-mediapipe-and-opencv-in-javascript-c87980df3acb

Modification that is the basis of this tool¶

Our modification here is just the implementation of Susanne Thierfelders solution in the envisionBox workflow by adapting the mediapipe tracking pipeline we already have in envisionBOX. Here we track the hands, body, and facemesh adding head rotations.

Use¶

Make sure to install all the packages in requirements.txt. Then move your videos that you want to mask into the input folder. Then run this code, which will loop through all the videos contained in the input folder; and saves all the results in the output folders.

Please use, improve and adapt as you see fit.

In [2]:
#load in required packages
import mediapipe as mp #mediapipe
import cv2 #opencv
import math #basic operations
import numpy as np #basic operations
import pandas as pd #data wrangling
import csv #csv saving
import os #some basic functions for inspecting folder structure etc.

#list all videos in input_videofolder
from os import listdir
from os.path import isfile, join
#time series output folder
inputfol = "./Input_Videos/"
vfiles = [f for f in listdir(inputfol) if isfile(join(inputfol, f))] #loop through the filenames and collect them in a list
outputf = "./Output_Videos/"
outtputf_ts = "./Output_TimeSeries/"

#check videos to be processed
print("The following folder is set as the output folder where all the pose time series are stored")
print(os.path.abspath(outtputf_ts))
print("\n The following folder is set as the output folder for saving the masked videos ")
print(os.path.abspath(outputf))
print("\n The following video(s) will be processed for masking: ")
print(vfiles)
The following folder is set as the output folder where all the pose time series are stored
C:\Users\u668173\Desktop\wimpouwenvisionboxwp\envisionBOX_modulesWP\MediaPipeFaceRotation\Output_TimeSeries

 The following folder is set as the output folder for saving the masked videos 
C:\Users\u668173\Desktop\wimpouwenvisionboxwp\envisionBOX_modulesWP\MediaPipeFaceRotation\Output_Videos

 The following video(s) will be processed for masking: 
['WIN_20240419_10_25_17_Pro.mp4']
In [3]:
#initialize modules and functions

#load in mediapipe modules
mp_holistic = mp.solutions.holistic
# Import drawing_utils and drawing_styles.
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles

##################FUNCTIONS AND OTHER VARIABLES
#landmarks 33x that are used by Mediapipe (Blazepose)
markersbody = ['NOSE', 'LEFT_EYE_INNER', 'LEFT_EYE', 'LEFT_EYE_OUTER', 'RIGHT_EYE_OUTER', 'RIGHT_EYE', 'RIGHT_EYE_OUTER',
          'LEFT_EAR', 'RIGHT_EAR', 'MOUTH_LEFT', 'MOUTH_RIGHT', 'LEFT_SHOULDER', 'RIGHT_SHOULDER', 'LEFT_ELBOW', 
          'RIGHT_ELBOW', 'LEFT_WRIST', 'RIGHT_WRIST', 'LEFT_PINKY', 'RIGHT_PINKY', 'LEFT_INDEX', 'RIGHT_INDEX',
          'LEFT_THUMB', 'RIGHT_THUMB', 'LEFT_HIP', 'RIGHT_HIP', 'LEFT_KNEE', 'RIGHT_KNEE', 'LEFT_ANKLE', 'RIGHT_ANKLE',
          'LEFT_HEEL', 'RIGHT_HEEL', 'LEFT_FOOT_INDEX', 'RIGHT_FOOT_INDEX']

markershands = ['LEFT_WRIST', 'LEFT_THUMB_CMC', 'LEFT_THUMB_MCP', 'LEFT_THUMB_IP', 'LEFT_THUMB_TIP', 'LEFT_INDEX_FINGER_MCP',
              'LEFT_INDEX_FINGER_PIP', 'LEFT_INDEX_FINGER_DIP', 'LEFT_INDEX_FINGER_TIP', 'LEFT_MIDDLE_FINGER_MCP', 
               'LEFT_MIDDLE_FINGER_PIP', 'LEFT_MIDDLE_FINGER_DIP', 'LEFT_MIDDLE_FINGER_TIP', 'LEFT_RING_FINGER_MCP', 
               'LEFT_RING_FINGER_PIP', 'LEFT_RING_FINGER_DIP', 'LEFT_RING_FINGER_TIP', 'LEFT_PINKY_FINGER_MCP', 
               'LEFT_PINKY_FINGER_PIP', 'LEFT_PINKY_FINGER_DIP', 'LEFT_PINKY_FINGER_TIP',
              'RIGHT_WRIST', 'RIGHT_THUMB_CMC', 'RIGHT_THUMB_MCP', 'RIGHT_THUMB_IP', 'RIGHT_THUMB_TIP', 'RIGHT_INDEX_FINGER_MCP',
              'RIGHT_INDEX_FINGER_PIP', 'RIGHT_INDEX_FINGER_DIP', 'RIGHT_INDEX_FINGER_TIP', 'RIGHT_MIDDLE_FINGER_MCP', 
               'RIGHT_MIDDLE_FINGER_PIP', 'RIGHT_MIDDLE_FINGER_DIP', 'RIGHT_MIDDLE_FINGER_TIP', 'RIGHT_RING_FINGER_MCP', 
               'RIGHT_RING_FINGER_PIP', 'RIGHT_RING_FINGER_DIP', 'RIGHT_RING_FINGER_TIP', 'RIGHT_PINKY_FINGER_MCP', 
               'RIGHT_PINKY_FINGER_PIP', 'RIGHT_PINKY_FINGER_DIP', 'RIGHT_PINKY_FINGER_TIP']
facemarks = [str(x) for x in range(478)] #there are 478 points (see google holistic face mesh info for landmarks)

print("Note that we have the following number of pose keypoints for markers body")
print(len(markersbody))

print("\n Note that we have the following number of pose keypoints for markers hands")
print(len(markershands))

print("\n Note that we have the following number of pose keypoints for markers face")
print(str(len(facemarks )) + '  !!!plus three rotation values!!!')

#set up the column names and objects for the time series data (add time as the first variable)
markerxyzbody = ['time']
markerxyzhands = ['time']
markerxyzface = ['time']

for mark in markersbody:
    for pos in ['X', 'Y', 'Z', 'visibility']: #for markers of the body you also have a visibility reliability score
        nm = pos + "_" + mark
        markerxyzbody.append(nm)
for mark in markershands:
    for pos in ['X', 'Y', 'Z']:
        nm = pos + "_" + mark
        markerxyzhands.append(nm)
for mark in facemarks:
    for pos in ['X', 'Y', 'Z']:
        nm = pos + "_" + mark
        markerxyzface.append(nm)

# we also add rotations x, y, z to the facemarks
markerxyzface = markerxyzface + ['face_rotation_X', 'face_rotation_Y', 'face_rotation_Z']

#check if there are numbers in a string
def num_there(s):
    return any(i.isdigit() for i in s)

#take some google classification object and convert it into a string
def makegoginto_str(gogobj):
    gogobj = str(gogobj).strip("[]")
    gogobj = gogobj.split("\n")
    return(gogobj[:-1]) #ignore last element as this has nothing

#make the stringifyd position traces into clean numerical values
def listpostions(newsamplemarks):
    newsamplemarks = makegoginto_str(newsamplemarks)
    tracking_p = []
    for value in newsamplemarks:
        if num_there(value):
            stripped = value.split(':', 1)[1]
            stripped = stripped.strip() #remove spaces in the string if present
            tracking_p.append(stripped) #add to this list  
    return(tracking_p)
Note that we have the following number of pose keypoints for markers body
33

 Note that we have the following number of pose keypoints for markers hands
42

 Note that we have the following number of pose keypoints for markers face
478  !!!plus three rotation values!!!
In [5]:
# do you want to apply masking?
masking = False
#We will now loop over all the videos that are present in the video file
for vidf in vfiles:
    print("We will now process video:")
    print(vidf)
    print("This is video number" + str(vfiles.index(vidf))+ "of" + str(len(vfiles)) + "videos in total")
    #capture the video, and check video settings
    videoname = vidf
    videoloc = inputfol + videoname
    capture = cv2.VideoCapture(videoloc) #load in the videocapture
    frameWidth = capture.get(cv2.CAP_PROP_FRAME_WIDTH) #check frame width
    frameHeight = capture.get(cv2.CAP_PROP_FRAME_HEIGHT) #check frame height
    samplerate = capture.get(cv2.CAP_PROP_FPS)   #fps = frames per second

    #make an 'empty' video file where we project the pose tracking on
    fourcc = cv2.VideoWriter_fourcc(*'MP4V') #for different video formats you could use e.g., *'XVID'
    out = cv2.VideoWriter(outputf+videoname, fourcc, 
                          fps = samplerate, frameSize = (int(frameWidth), int(frameHeight)))

    # Run MediaPipe frame by frame using Holistic with `enable_segmentation=True` to get pose segmentation.
    time = 0
    tsbody = [markerxyzbody]   #these will be your time series objects, which start with collumn names initialized above
    tshands = [markerxyzhands] #these will be your time series objects, which start with collumn names initialized above
    tsface = [markerxyzface]   #these will be your time series objects, which start with collumn names initialized above

    with mp_holistic.Holistic(
            static_image_mode=False, enable_segmentation=True, refine_face_landmarks=True) as holistic:
        while (True):
            ret, image = capture.read() #read frame
            if ret == True: #if there is a frame
                image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) #make sure the image is in RGB format
                results = holistic.process(image) #apply Mediapipe holistic processing
                # Draw pose segmentation
                h, w, c = image.shape
                face_3d = []
                face_2d = []
                if  np.all(results.segmentation_mask) != None: #check if there is a pose found
                    if masking == False:
                        original_image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
                    if masking == True:
                        original_image = np.concatenate([image, np.full((h, w, 1), 255, dtype=np.uint8)], axis=-1)
                        mask_img = np.zeros_like(image, dtype=np.uint8) #set up basic mask image
                        mask_img[:, :] = (255,255,255) #set up basic mask image
                        segm_2class = 0.2 + 0.8 * results.segmentation_mask #set up a segmentation of the results of mediapipe
                        segm_2class = np.repeat(segm_2class[..., np.newaxis], 3, axis=2) #set up a segmentation of the results of mediapipe
                        annotated_image = mask_img * segm_2class * (1 - segm_2class) #take the basic mask image and make a sillhouette mask
                        # append Alpha channel to sillhouetted mask so that we can overlay it to the original image
                        mask = np.concatenate([annotated_image, np.full((h, w, 1), 255, dtype=np.uint8)], axis=-1)
                        # Zero background where we want to overlay
                        original_image[mask==0]=0 #for the original image we are going to set everything at zero for places where the mask has to go
                        original_image = cv2.cvtColor(original_image, cv2.COLOR_RGB2BGR)
                    #R for the head we also estimate rotations
                    for idx, lm in enumerate(results.face_landmarks.landmark):
                        if idx == 33 or idx == 263 or idx == 1 or idx == 61 or idx == 291 or idx == 199:
                            if idx == 1:
                                nose_2d = (lm.x * w, lm.y * h)
                                nose_3d = (lm.x * w, lm.y * h, lm.z * 3000)
                            x, y = int(lm.x * w), int(lm.y * h)
                            face_2d.append([x, y])
                            face_3d.append([x, y, lm.z])
                    #R Convert it to the NumPy array
                    face_2d = np.array(face_2d, dtype=np.float64)

                    #R Convert it to the NumPy array
                    face_3d = np.array(face_3d, dtype=np.float64)  

                    #R camera matrix
                    focal_length = 1*w
                    cam_matrix =  np.array([ [focal_length, 0, h / 2],
                                    [0, focal_length, w / 2],
                                    [0, 0, 1]])
                    #R The distortion parameters
                    dist_matrix = np.zeros((4, 1), dtype=np.float64)

                    #R Solve the PnP problem
                    success, rot_vec, trans_vec = cv2.solvePnP(face_3d, face_2d, cam_matrix, dist_matrix)

                    #R Get rotational matrix
                    rmat, jac = cv2.Rodrigues(rot_vec)
                    
                    #R Get angles
                    angles, mtxR, mtxQ, Qx, Qy, Qz = cv2.RQDecomp3x3(rmat)

                    # Get the rotation degree
                    xrot = angles[0] * 360 # up and down
                    yrot = angles[1] * 360 # left and right
                    zrot = angles[2] * 360 # tilt
                    rotations = [xrot, yrot, zrot]
                   
                    #R Display the nose direction
                    nose_3d_projection, jacobian = cv2.projectPoints(nose_3d, rot_vec, trans_vec, cam_matrix, dist_matrix)
                    p1 = (int(nose_2d[0]), int(nose_2d[1]))
                    p2 = (int(nose_2d[0] + yrot * 30) , int(nose_2d[1] - xrot * 30))    # the constant integers deterimines the length of the line
                    cv2.line(original_image, p1, p2, (0, 0, 200), 7)
                    # Add the text on the image
                    cv2.putText(original_image, "x pitch/headraise: " + str(np.round(xrot,2)), (450, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 200), 2)
                    cv2.putText(original_image, "y yaw/headturn: " + str(np.round(yrot,2)), (450, 75), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 200), 2)
                    cv2.putText(original_image, "z roll/headtilt: " + str(np.round(zrot,2)), (450, 100), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 200), 2)
                    
                    #now lets draw on the original_image the left and right hand landmarks, the facemesh and the body poses
                        #left hand
                    mp_drawing.draw_landmarks(original_image, results.left_hand_landmarks, mp_holistic.HAND_CONNECTIONS)
                        #right hand
                    mp_drawing.draw_landmarks(original_image, results.right_hand_landmarks, mp_holistic.HAND_CONNECTIONS)
                        #face
                    mp_drawing.draw_landmarks(
                            original_image,
                            results.face_landmarks,
                            mp_holistic.FACEMESH_TESSELATION,
                            landmark_drawing_spec=None,
                            connection_drawing_spec=mp_drawing_styles
                            .get_default_face_mesh_tesselation_style())
                        #body (lets no draw the body, so that we can see the facemesh better)
                    #mp_drawing.draw_landmarks(
                    #        original_image,
                    #        results.pose_landmarks,
                    #        mp_holistic.POSE_CONNECTIONS,
                    #        landmark_drawing_spec=mp_drawing_styles.
                    #        get_default_pose_landmarks_style())
                    #######################now save everything to a time series
                        #make a variable list with x, y, z, info where data is appended to
                    samplebody = listpostions(results.pose_world_landmarks) # we are saving it in body centered estimates in real world coordinates (meters)
                    samplehands = listpostions([results.left_hand_landmarks, results.right_hand_landmarks])
                    sampleface = listpostions(results.face_landmarks)
                    samplebody.insert(0, time)
                    samplehands.insert(0, time)
                    sampleface.insert(0, time)
                    tsbody.append(samplebody)   #append to the timeseries object
                    tshands.append(samplehands) #append to the timeseries object
                    tsface.append(sampleface)   #append to the timeseries object
                    # add the rotations
                    tsface[-1].extend(rotations)
                #show the video as we process (you can comment this out, if you want to run this process in the background)
                cv2.imshow("resizedimage", original_image)
                out.write(original_image) #save the frame to the new masked video
                time = time+(1000/samplerate)#update the time variable  for the next frame
            if cv2.waitKey(1) == 27: #allow the use of ESCAPE to break the loop
                   break
            if ret == False: #if there are no more frames, break the loop
                break

    #once done de-initialize all processes
    out.release()
    capture.release()
    cv2.destroyAllWindows()
     ####################################################### data to be written row-wise in csv fil
    # opening the csv file in 'w+' mode
    filebody = open(outtputf_ts + vidf[:-4]+'_body.csv', 'w+', newline ='')
    #write it
    with filebody:    
        write = csv.writer(filebody)
        write.writerows(tsbody)
     # opening the csv file in 'w+' mode
    filehands = open(outtputf_ts + vidf[:-4]+'_hands.csv', 'w+', newline ='')
    #write it
    with filehands:
        write = csv.writer(filehands)
        write.writerows(tshands)
    # opening the csv file in 'w+' mode
    fileface = open(outtputf_ts + vidf[:-4]+'_face.csv', 'w+', newline ='')
    #write it
    with fileface:    
        write = csv.writer(fileface)
        write.writerows(tsface)

print("Done with processing all folders; go look in your output folders!")    
We will now process video:
WIN_20240419_10_25_17_Pro.mp4
This is video number0of1videos in total
Done with processing all folders; go look in your output folders!